Adapted with permission from STAT 301 Project by: Justin Wong, Kevin Yu, Zhuoran (Serena) Feng, Fiona Chang

1 Summary

In this project, we perform a data analysis to determine the factors that impact the predict the productivity of a garment factory. Using forward selection and LASSO, we compare different models and determine which factors are the best in explaining relationships between the factors and the actual productivity of the garment factory. Furthermore, we discuss the implications of our results, the limitations of the project, and propose future questions that can be asked based on our project.

2 Introduction

The trillion-dollar garment industry is largely fueled by the production and performance of employees that work in manufacturing companies as a labor-intensive, low-skilled industry (Hamja, Maalouf, and Hasle 2019). As the industry is driven by ever-changing consumer demands and fashion trends, the need for manual processes is inevitable. Through statistical inference, we seek to dig deeper into the relationship between important attributes of the garment manufacturing process and its employees’ productivity in the following question: What factors affect the productivity of a garment factory?

The studies: “Enhancing Efficiency and Productivity of Garment Industry by Using Different Techniques” (Rajput et al. 2018) and “The Effect of Lean on Occupational Health and Safety and Productivity in the Garment Industry” (Hamja, Maalouf, and Hasle 2019) will be utilized to help frame our exploration into this data set and provide useful context of the garment industry.

2.1 Attribute Information

The data set we will use, called Productivity Prediction of Garment Employees, is sourced from Kaggle.com (Siri 2021) and outlines following variables that will guide us in answering our question: - date: Date in MM-DD-YYYY - quarter: A portion of a month, where each month was divided into 4 quarters - department: Associated department - day: Day of the week - team: Associated team number - targeted_productivity_set: Daily target productivity set by authority - smv: Standard Minute Value; allocated time for a task - wip: Work in progress; includes number of unfinished items for products - over_time: amount of overtime by each team (minutes) - Incentive: amount of financial incentive in BDT (Bangladeshi currency) - idle_time: amount of time where production was interrupted - Idle_men: number of workers idle due to interrupted production - no_of_style_changes: number of style changes - no_of_workers: number of workers in given team - actual_productivity: actual % of productivity delivered

From this list, date and team were excluded in our analysis as they are identifiers for the observation, hence not the interest for our research question.

3 Methods and Results

3.1 Preliminary Results

Table 3.1: Initial Dataset
date quarter department day team targeted_productivity smv wip over_time incentive idle_time idle_men no_of_style_change no_of_workers actual_productivity
1/1/2015 Quarter1 sweing Thursday 8 0.80 26.16 1108 7080 98 0 0 0 59.0 0.9407254
1/1/2015 Quarter1 finishing Thursday 1 0.75 3.94 NA 960 0 0 0 0 8.0 0.8865000
1/1/2015 Quarter1 sweing Thursday 11 0.80 11.41 968 3660 50 0 0 0 30.5 0.8005705
Table 3.2: Modified Dataset
department day targeted_productivity smv wip over_time incentive idle_time idle_men no_of_style_change actual_productivity half
sewing Weekday 0.80 26.16 1108 7080 98 0 0 0 0.9407254 Half1
finishing Weekday 0.75 3.94 0 960 0 0 0 0 0.8865000 Half1
sewing Weekday 0.80 11.41 968 3660 50 0 0 0 0.8005705 Half1

Both the categorical variables day and quarter were edited becasue they have more than two levels, which leads to difficulties in conducting forward selection. Based on this, day was changed into two levels: Weekday and Weekend. Similarilily, quarter was used to create the variable half with two levels.

ggpairs Plot

Figure 3.1: ggpairs Plot

From this plot, we can analyze the correlation values between the variables that we are using in our analysis. Based on the correlation values, there appears to be correlation between input variables, which will be addressed later in the analysis. Variables with relatively high correlations (over 0.65) include no_of_workers and smv, no_of_workers and over_time, and over_time and smv.

The high correlation indicates that the dataset has an issue of multicollinearity. To address this, the variable with the (no_of_workers) will be removed from our analysis.

Actual Productivity by Day of the Week Boxplot

Figure 3.2: Actual Productivity by Day of the Week Boxplot

Actual Productivity by Half Boxplot

Figure 3.3: Actual Productivity by Half Boxplot

Actual Productivity of Departments Boxplot

Figure 3.4: Actual Productivity of Departments Boxplot

The above boxplots show the medians and variances of the discrete factors of interest. Since the above plots show that there are differences in the medians and variances of the factors of interest, it is justified to keep these factors in our analysis for future investigation.

Distribution of Actual Productivity

Figure 3.5: Distribution of Actual Productivity

Since, the distribution of the actual productivity doesn’t seem to have a normal distribution and is slightly left-skewed, an assumption of normality is likely needed in our analysis.

Q-Q Plot of Actual Productivity

Figure 3.6: Q-Q Plot of Actual Productivity

The above Q-Q plot was used to determine whether a normality assumption on our response variable is valid, since it is an assumption required for tests used later on. There appears to be tails on both ends, which suggests a left-skewness of the data. Unfortunately, while we have tried different transformations of the data, it did not improve the skewness of this Q-Q plot. So based on what we have learned in this class, we will have to assume normality of the data even though it is a major stretch.

Table 3.3: Summary of Actual Productivity for Halves and Departments
half department count mean median min max sd
Half1 finishing 296 0.7616317 0.8096023 0.2380417 1.096633 0.1823244
Half1 sewing 399 0.7374970 0.7999632 0.2337055 1.100484 0.1522631
Half2 finishing 210 0.7407145 0.7862321 0.2357955 1.120437 0.2159050
Half2 sewing 292 0.7008552 0.7500680 0.2494167 1.000457 0.1559532

This table provides relevant summaries of our data, split into different quarters and departments. Overall, the data seems to be relatively consistent, with a few things to note:

  • The relative means and medians are fairly consistent throughout the different departments and halves.
  • The standard deviation for the finishing department is slightly larger than the sewing department.

3.2 Methods: Plan

The data set used is trustworthy and reliable since multiple published academic papers used this data set (Al Imran et al. 2019), (Imran, Rahim, and Ahmed 2021).

Using the data set, we plan to analyze what factors are the most important in productivity. Linear regression will be used to determine the best inference model for the actual productivity of the factory. Using forward selection and LASSO, we plan to compare different models and determine which factors are the best in explaining relationships between the factors and the actual productivity of the garment factory. Additionally, we plan to test our optimal inference model’s performance by splitting the data into training and testing and comparing the corresponding adjusted \(R^2\) values with the full model.

We expect that factors such as number of members on the team and targeted productivity may have a higher association with actual productivity. Thus, we expect these factors to be present in the best model for explaining the relationship with the actual productivity.

The results from this report could provide insights to companies in the garment manufacturing sector. Having knowledge on what factors may increase productivity is crucial for any successful business.

3.3 Results

3.3.1 Generating the Model with Forward Selection

Table 3.4: Top 3 Rows of Training Data
department day targeted_productivity smv wip over_time incentive idle_time idle_men no_of_style_change actual_productivity half
sewing Weekend 0.75 18.79 1193 3960 45 0 0 0 0.7506510 Half1
sewing Weekend 0.80 26.16 1128 10620 63 0 0 0 0.8001171 Half2
finishing Weekend 0.75 3.94 0 1620 0 0 0 0 0.9617845 Half2
Table 3.4: Top 3 Rows of Testing Data
department day targeted_productivity smv wip over_time incentive idle_time idle_men no_of_style_change actual_productivity half
finishing Weekday 0.75 3.94 0 960 0 0 0 0 0.7551667 Half1
sewing Weekday 0.75 19.87 733 6000 34 0 0 0 0.7530975 Half1
finishing Weekday 0.65 3.94 0 960 0 0 0 0 0.7059167 Half1
Table 3.5: Evaluation Metrics for Forward Selection
n_input_variables RSQ RSS ADJ.R2
1 0.1945884 21.49255 0.1936885
2 0.2164887 20.90814 0.2147359
3 0.2281308 20.59747 0.2255377
4 0.2310129 20.52056 0.2275645
5 0.2337486 20.44755 0.2294487
6 0.2377994 20.33946 0.2326610
7 0.2384121 20.32311 0.2324154
8 0.2389621 20.30843 0.2321059
9 0.2392325 20.30121 0.2315134
10 0.2394983 20.29412 0.2309148
11 0.2394995 20.29409 0.2300469

Adjusted \(R^2\) was chosen as the metric used for model selection because it is best suited for our inference model. It compensates for the reduction of the RSS of a larger model making it a more suitable metric than \(R^2\).

As shown by the table above, the model with 6 input variables has the highest adj \(R^2\), thus this model is chosen as the optimal model. However, its adjusted \(R^2\) value is not relatively higher than many of the other models, suggesting that the model may be only slightly better than the others. The selected model will be compared with the full model to test this observation.

The variables selected in the model with 6 input variables were: targeted_productivity, smv, wip, incentive, idle_men, and no_of_style_change.

Table 3.6: Forward Selection Model Summary
term estimate std.error statistic p.value
(Intercept) 0.2425582 0.0393780 6.159742 0.0000000
targeted_productivity 0.6992923 0.0520280 13.440690 0.0000000
smv -0.0012414 0.0005188 -2.392951 0.0169199
wip 0.0000076 0.0000035 2.174850 0.0299041
incentive 0.0000677 0.0000357 1.896537 0.0582127
idle_men -0.0066583 0.0015499 -4.296067 0.0000193
no_of_style_change -0.0369644 0.0127146 -2.907235 0.0037369

Adjusted \(R^2\) for Selected Model: 0.2327

Residuals of Selected Model

Figure 3.7: Residuals of Selected Model

QQ-plot of Selected Model

Figure 3.8: QQ-plot of Selected Model

The residual plot and the Q-Q plot suggests some violations of assumptions needed for our analysis. The residual plot shows slight heteroskedasticity within our model, which violates our equal variance assumption, and the q-q plot suggests a violation in our normality assumption.

Table 3.7: Summary Table for Full Model
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.2436616 0.0407603 5.9779105 0.0000000 0.1636634 0.3236597
departmentsewing 0.0169580 0.0219195 0.7736501 0.4393443 -0.0260623 0.0599783
dayWeekend 0.0061356 0.0108010 0.5680563 0.5701408 -0.0150630 0.0273341
targeted_productivity 0.6988608 0.0526825 13.2655250 0.0000000 0.5954636 0.8022580
smv -0.0018777 0.0009822 -1.9117663 0.0562289 -0.0038055 0.0000500
wip 0.0000071 0.0000036 1.9496519 0.0515331 0.0000000 0.0000142
over_time -0.0000001 0.0000021 -0.0364780 0.9709094 -0.0000043 0.0000041
incentive 0.0000662 0.0000359 1.8430641 0.0656539 -0.0000043 0.0001367
idle_time 0.0003560 0.0004497 0.7916867 0.4287555 -0.0005266 0.0012387
idle_men -0.0077306 0.0020215 -3.8242258 0.0001404 -0.0116981 -0.0037632
no_of_style_change -0.0351849 0.0135110 -2.6041730 0.0093640 -0.0617022 -0.0086676
halfHalf2 -0.0057576 0.0104672 -0.5500580 0.5824184 -0.0263009 0.0147858

Adjusted \(R^2\) for Full Model: 0.23

The adjusted \(R^2\) of the selected model (0.2327) is slightly larger than (0.23) the full model. This further suggests the selected model is not significantly better than the full model. An F-test will be conducted to test this observation.

Table 3.8: F-test for Full and Selected Model
Res.Df RSS Df Sum.of.Sq F Pr..F.
890 20.33946 NA NA NA NA
885 20.29409 5 0.0453663 0.3956736 0.8519733

Since the p-value is 0.852, at a 5% significance level, there is not enough evidence to reject the null hypothesis that the selected model performs better than the full model.

Table 3.9: Summary Table for Selected Model with Testing Data
term estimate std.error statistic p.value
(Intercept) 0.2689469 0.0788155 3.412359 0.0007347
targeted_productivity 0.6689556 0.1052842 6.353807 0.0000000
smv -0.0011865 0.0009566 -1.240232 0.2158819
wip 0.0000077 0.0000067 1.138709 0.2557548
incentive 0.0000563 0.0000465 1.210549 0.2270439
idle_men -0.0099292 0.0030339 -3.272772 0.0011924
no_of_style_change -0.0285627 0.0243788 -1.171623 0.2423002

Adjusted \(R^2\) for Selected Model Using Testing Data: 0.1702

The adjusted \(R^2\) suggests that about 17% of the adjusted variation in the response is explained by model. This indicates that the model performs fairly poorly. However, since it was the best model according to our analysis it may suggest that further exploration of this topic is needed. Thus, model selection using LASSO is conducted next.

3.3.2 Generating the Model with LASSO

Lambda Selection by CV with LASSO

Figure 3.9: Lambda Selection by CV with LASSO

The variables selected in the model by LASSO were: targeted_productivity, smv, incentive, idle_men, and no_of_style_change.

Residuals of Selected Model with LASSO

Figure 3.10: Residuals of Selected Model with LASSO

QQ-plot of Selected Model with LASSO

Figure 3.11: QQ-plot of Selected Model with LASSO

Much like the model chosen by forward selection, the residual plot for the model selected by LASSO suggests unequal variance, and the Q-Q plot above suggests a normality violation of the variables. Attempts to mitigate this issue have failed, meaning we will have to continue with our analysis with very strong assumptions of our data.

Adjusted \(R^2\) for Selected Model 0.2294

Table 3.10: F-test for Full and Model Selected with LASSO
Res.Df RSS Df Sum.of.Sq F Pr..F.
891 20.44755 NA NA NA NA
885 20.29409 6 0.1534619 1.11538 0.3512442

Since the p-value is 0.351, at a 5% significance level, there is not sufficient evidence to reject the null hypothesis that the selected model by LASSO performs better than the full model.

Table 3.11: Summary Table for Selected Model with Testing Data
term estimate std.error statistic p.value
(Intercept) 0.2655908 0.0788001 3.3704351 0.0008508
targeted_productivity 0.6735619 0.1052595 6.3990577 0.0000000
smv -0.0008335 0.0009055 -0.9204962 0.3580682
incentive 0.0000573 0.0000465 1.2312795 0.2192024
idle_men -0.0099995 0.0030348 -3.2949306 0.0011050
no_of_style_change -0.0297716 0.0243679 -1.2217536 0.2227795

Adjusted \(R^2\) for Selected Model Using Testing Data: 0.1694

The adjusted \(R^2\) suggests that about 16.9% of the adjusted variation in the response is explained by model. This reveals that forward selection performed slightly better in producing an inference model for the actual productivity of the factory.

4 Discussion

4.1 Findings

4.1.1 Model Selection with Forward Selection and LASSO

The variables selected from forward selection in our model were: targeted_productivity, smv, wip, incentive, idle_men, and no_of_style_change. The variables selected from LASSO in our model were: targeted_productivity, smv, incentive, idle_men, and no_of_style_change.

Both of the models produced a fairly poor adjusted \(R^2\) values of 0.1702 and 0.1694 when testing the model with the testing data. Additionally, neither of the selected models were significantly better than the full model according to the corresponding F-tests.

4.1.2 Limitations

The relatively poor performance of both selected models and the non-significant results from the corresponding F-tests may be due to the assumptions made throughout our analysis. The techniques learned in this class were likely not able to overcome the limitations and assumptions made throughout our analysis. As shown by the various model assumption plots throughout the analysis, the assumptions of equal variance and normality were used when analyzing the response variable, as well as both of the models selected via forward selection and LASSO.

The assumption of normality was required for the F-tests used near the end of our analysis. With our diagnostic plots shown earlier showing some violation of the assumption, it may impact our results when testing for whether the models we selected were statistically different than the full model.

Another factor that may have led to our results being non-significant relative to the full model is the response variable (actual productivity) ranges from 0 to 1 while our explanatory variables have much broader ranges. In addition, the variables wip, incentive, idle_time, and idle_men, used in our analysis contained a large amount of 0s with a few observed large values leading to abnormal distributions of values.

These issues regarding normality and heteroskedasticity in our data and models may have resulted in inaccurate standard errors and thus induced lower precision from coefficient estimates, as well as inaccurate p-values and F-statistics. These limitations may have affected the statistical significance of our results.

4.1.3 Impact

The common variables included from both forward selection and LASSO were targeted_productivity, smv, incentive, idle_men, and no_of_style_change. The models suggest that the daily set productivity, allocated time for a task, financial incentive, number of idle workers, and number of style changes have the strongest correlation with actual productivity. These findings can drive business decisions of those in management and leadership positions as they could potentially manipulate each variable to drive the highest amount of productivity, and thus, profit. For example, a higher monetary incentive per item will motivate workers to be more efficient, and can have higher payoffs overall, though one should be cautious of unsatisfactory work.

The study “Enhancing Efficiency and Productivity of Garment Industry by Using Different Techniques” relays methods of increasing productivity by eliminating factors such as idle time, related to the idle_men variable in our model. These methods include time study, implementing a visual management system, and standardized work procedures which increased efficiency by 8.07% (Rajput et al. 2018). Focussing on decreasing the number of style changes through process management can also increase productivity where less set-up and transition times between patterns reduce time wasted. “The effect of lean on occupational health and safety and productivity in the garment industry” outlines lean methodology where waste is minimized by reducing variability on all fronts of production (Hamja, Maalouf, and Hasle 2019). While there is evidence of positive effects on productivity by using lean, the literature also points to a potential negative impact on workers’ health– which morally outweighs monetary profit (Hamja, Maalouf, and Hasle 2019).

4.1.4 Future Questions

This study prompts further questions about the garment industry. - How can we improve these variables to have a more efficient productivity? - What is the threshold of which productivity is maximized? - What other variables outside of this dataset, particularly involving technological innovation, affect productivity? - What is the environmental impact of increasing productivity?

References

Al Imran, Abdullah, Md Nur Amin, Md Rifatul Islam Rifat, and Shamprikta Mehreen. 2019. “Deep Neural Network Approach for Predicting the Productivity of Garment Employees.” In 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), 1402–7. IEEE.
Hamja, Abu, Malek Maalouf, and Peter Hasle. 2019. “The Effect of Lean on Occupational Health and Safety and Productivity in the Garment Industry–a Literature Review.” Production & Manufacturing Research 7 (1): 316–34.
Imran, Abdullah Al, Md Shamsur Rahim, and Tanvir Ahmed. 2021. “Mining the Productivity Data of the Garment Industry.” International Journal of Business Intelligence and Data Mining 19 (3): 319–42.
Rajput, Dhanashree, Madhuri Kakde, Pranjali Chandurkar, and PP Raichurkar. 2018. “Enhancing Efficiency and Productivity of Garment Industry by Using Different Techniques.” International Journal on Textile Engineering and Processes 4 (1): 5–8.
Siri, S. 2021. “Productivity Prediction of Garment Employees.” Kaggle. https://www.kaggle.com/datasets/ishadss/productivity-prediction-of-garment-employees.